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(57) Abstract: An algorithm for video summarization is described. The algorithm combines photometric and motion information. 
According to the algorithm the correspondence between feeature points is used to detect shot boundaries and to select key frames. 
Thus, the rate of feature points, which are lost or initiated, is used as an indication if a shot transition occurred or not. Key frames 
are selected as frames where the activity change is low. 
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A method and a system for generating summarized video 
TECHNICAL FIELD 

The present invention relates to a method and a system for video 
summarization, and in particular to a method and a system for key 
frame extraction and shot boundary detection. 

BACKGROUND OF THE INVENTION AND PRIOR ART 

Recent developments in personal computing and communications have 
created new classes of devices such as hand-held computers, 
personal digital assistants (PDAs) , smart phones, automotive 
computing devices , and computers that allow users more access to 
information. 

Many of the device manufacturers, including cell phone, PDA, and 
hand-held computer manufacturers, are working to grow the 
functionalities of their devices. The devices are being given 
capabilities of serving as calendar tools, address books, paging 
devices, global positioning devices, travel and mapping tools, 
email clients, and Web browsers. As a result, many new businesses 
are forming around applications related to bringing all kinds of 
information to these devices. However, due to the limited 
capabilities of many of these devices, in terms of the display 
size, storage, processing power, and network access, there are new 
challenges for designing the applications that allow these devices 
to access, store and process information. 

Concurrent with these developments, recent advances in storage, 
acquisition, and networking technologies has resulted in large 
amounts of rich multimedia content. As a result, there is a 
growing mismatch between the rich content that is available and 
the capabilities of the client devices to access and process it. 

In this respect so called key-frame based video summarization is 
an efficient way to manage and transmit video information. This 
representation can be used within the MPEG-7 application Universal 
Multimedia Access as described in C. Christopoulos et al., "MPEG-7 
application: Universal access through content repurporsing and 
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media conversion", Seoul, Korea, March 1999, 

ISO/IEC/JTC1/SC29/WG11 M4433, in order to adapt video data to the 
client devices. 

For Audio-Visual material, the key frame extraction could be used 
in order to adapt to bandwidth and computational capabilities of 
the clients. For example, low bandwidth or low capability clients 
might request only the audio information to be delivered, or only 
he audio combined with some key frames. High bandwidth and 
computational efficient clients can request the whole AV material. 
Another application is fast browsing to digital video. Skipping 
video frames at fixed intervals reduce the video viewing time. 
However this merely gives a random sample of the overall video. 

Below the following definitions will be used: 

Shot 

A shot is defined as a sequence of frames captured by one camera 
in a single continuos action in time and space, see also J. 
Monaco, "How to read a film", Oxford Press, 1981. 

Shot boundary 

There are a number of different types of boundaries between shots. 
A cut is an abrupt shot change that occurs in a single frame. A 
fade is a gradual change in brightness resulting in (fade-out) or 
starting with a black frame (fade-in) . A dissolve occurs when the 
images of the first shot become dimmer and the images of the 
second shot become brighter, with frames within the transition 
showing one image superimposed on the other one. A wipe occurs 
when pixels from the second shot replace those of the first shot 
in a regular pattern such as a line from the left edge of the 
frames . 

Key frame 

Key frames are defined inside every shot. They represent with a 
small number of frames, the most relevant information content of 
the shot according to some subjective or objective measurement. 

Conventional video summarization consists of two steps: 
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1. Shot boundary detection. 

2. Key-frame extraction. 

Many attributes of the frames such as colour, motion and shape 
have been used for video summarization. The standard algorithm for 
shot boundary detection in video summarization is based on 
histograms. Histogram-based techniques are shown to be robust and 
effective as described in A. Smeulders and R. Jain, "Image 
databases and Multi-Media search", Singapore, 1988, and in J.S. 
Boreczky, and L.A. Rowe, "Comparison of Video Shot Boundary 
Detection Techniques" , Storage and Retrieval for Image and Video 
Databases IV, Proc. of IS&T/SPIE 1996 Int'l Symp. on Elec. 
Imaging: Science and Technology, San Jose, CA, February 1996. 

Thus, the colour histograms of two images are computed. If the 
Euclidean distance between the two histograms is above a certain 
threshold, a shot boundary is assumed. However, no information 
about motion is used. Therefore, this technique has drawbacks in 
scenes with camera and object motion. 

Furthermore, key frames must be extracted from the different shots 
in order to provide a video summary. Conventional key frame 
extraction algorithms are for example described in: Wayne Wolf, 
"Key frame selection by motion analysis", in Proceedings, ICASSP 
96, wherein the optical flow is used in order to identify local 
minima of motion in a shot. These local minima of motion are then 
determined to correspond to key frames. In W. Xiong, and J. C. M. 
Lee, and R. H. Ma, "Automatic video data structuring through shot 
partitioning and key-frame selection", Machine vision and 
Applications, vol.10, no. 2, pp. 51-65, 1997, a seek-and-spread 
algorithm is used where the previous key-frame as a reference for 
the extraction of the next key-frame. Also, in R. L. Lagendijk, 
and A. Hanjalic, and M. Ceccarelli, and M. Soletic, and E. 
Persoon, "Visual search in a SMASH system", Proceedings of IEEE 
ICIP 97, pp. 671-674, 1997, a cumulative action measure of shots 
in order to compute the number and the position of key-frames 
allocated to each shot is used. The action between two frames is 
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computed via a histogram-difference. One advantage of this method 
is that the number of key-frames can be pre-specif ied. 

SUMMARY 

It is an object of the present invention to provide a method and a 
system for shot boundary detection and key frame extraction, which 
can be used for video summarization and which is robust against 
camera and object motion. 

This object and others are obtained by a method and a system for 
key frame extraction, where a list of feature points is created. 
The list keeps track of individual feature points between 
consecutive frames of a video sequence. 

In the case when many new feature points are entered on the list 
or when many feature points are removed from the list between two 
consecutive frames a shot boundary is determined to have occurred. 
A key frame is then selected between two boundary shots as a frame 
in the list of feature points where no or few feature points are 
entered or lost in the list. 

By using such a method for extracting key frames from a video 
sequence motion in the picture and/or camera motion can be taken 
into account. The key frame extraction algorithm will therefore be 
more robust against camera motion. 

BRIEF DESCRIPTION OF THE DRAWINGS 

The present invention will now be described in more detail and 
with reference to the accompanying drawings, in which: 

- Figs, la and lb are flow charts illustrating an algorithm for 
shot boundary detection. 

- Fig. 2 is a block diagram illustrating the basic blocks of an 
apparatus for tracking feature points in consecutive video frames. 

- Fig. 3 is a diagram illustrating the activity change within a 
shot. 

- Fig. 4 shows a set of consecutive frames with detected feature 
points. 
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DETAILED DESCRIPTION 

In Figs, la and lb, flow charts illustrating the steps carried out 
during one iteration in an algorithm for shot boundary detection 
according to a first preferred embodiment are shown. 

Thus, with reference to Fig. la, first in a block 101 a first 
frame is input and the feature points of the first frame are 
extracted, and used as input in order to predict the feature 
points of the next frame. Next, in a block 103, a prediction of 
the feature points for the next frame is calculated. Thereupon, in 
a block 105 the next frame is input, and the feature points of the 
next frame are extracted in a block 107 using the same feature 
point extraction algorithm as in block 101. 

Many algorithms have been described for extracting such feature 
points, which could correspond to corner points. For example B. 
Lucas and T. Kanade, "An iterative image registration technique 
with an application to stereo vision", in Proc. 7th Int. Joint 
Conf . on Artificial Intelligence, 1981, pp. 674-679 describes one 
such method. Also, the method as described in S. K. Bhattachar jee, 
"Detection of feature points using an end-stopped wavelet", 
submitted to IEEE Trans. On Image Processing 1999, can be used. 

Next, in a block 109, a data association between estimated feature 
points and feature points extracted in block 107 is performed. An 
update of the list of feature points is then performed in a block 
111. Thereupon, an update of the estimate for each feature point 
on the list of feature paints is performed in a block 113. Finally 
the algorithm returns to block 103 and the next frame is input in 
the block 105 in order to perform a data association between the 
current estimated feature points and the feature points of the 
next frame . 

Each time, the algorithm in Fig. la updates the list of feature 
points in the block 111 it is checked if a shot boundary has 
occurred. This shot boundary detection procedure is illustrated in 
Fig. lb. Thus, first in a block 131, the updated list is input. A 
comparison between the current list of feature points and the list 
of previous feature points is then performed in a block 133. 
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If the number of lost feature points from the previous list of 
feature points or if the number of new feature points in the 
current list of feature points is larger than a pre-set threshold 
value, the procedure proceeds to a block 135, where the current 
frame is indicated as a shot-boundary. 

The procedure then returns to the block 131. If, on the other 
hand, it is decided in the block 13 3 that the current frame does 
not correspond to a shot boundary the procedure return directly to 
the block 131. 

In Figure 2, a block diagram of one iteration of an algorithm for 
key frame extraction using the shot boundary detection procedure 
as described in conjunction with Figs, la and lb is shown. A 
frame at time k is represented with a set of P feature points 

x„(k), n = 1, 2,---,P , which can consist of: 

* Kinematic components: position (x,y) and velocity (x,y) . 

* Photometric components, such as Gabor responses 

ifufnfij"') 

Where the number of feature points P of variable n representing a 
particular feature point at time k (or frame k) is a function of 
time. 

Photometric components are in general filter responses such as 
Gabor responses or Gaussian-derivative responses, computed by 
using the image intensities as input, see J. Malik, and P. Perona, 
"Preattentive texture discrimination with early vision 
mechanisms", J. Opt. Soc. Am., vol.7, no. 5, pp. 923-9 32, May 
1990. The use of photometric components in the algorithm as 
described herein will improve the scaling and rotation sensitivity 
in extraction of the feature points, but is optional. 

The feature vector x n {k) = {x, y,x,y,f u f -,,-••) is also referred to as 
state vector. Its components summarize current and past history of 
the feature point n in order to predict its future trajectory. 
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Feature points correspond to points, which contain a significant 
amount of texture, such as corner points. Such points are 
relatively easy to track. 

Referring to Figure 2, first in a block 201 at a feature points 
extraction stage, the vector z H {k + X) = (x,y,f l ,f 1 ,"') denoted as the n:th 
measurement vector at time k+1 is computed, n = \,2,---,P . 

Next, in a measurement prediction stage in block 203 , z n (k + \) is 

estimated given the predicted state vector x n (k) of the last frame 
k. Kalman filtering as described in A. Gelb, "Applied optimal 
estimation", MIT Press, 1974 can be used as estimation algorithm. 

Next, in a block 2 05 the correspondence between the predicted 
measurements z„(k + l) and the extracted measurements z n (k + l) is 
performed followed by an update of the list of feature points. 

Z ll (k + l) = {z n (l),z n (2),--;z n (k + l)} represents the n:th list of feature 

points up to time k+1. The Nearest Neighbour filter as described 
in Y. Bar-Shalom, and T. E. Fortmann, "Tracking and data 
association", Academic Press, 1988 can be used for data 
association in order to update the list of feature points. The 

estimated measurement vectors z n (k + \) , the list of feature points 
Z n (k) from the last frame k and the measurement vectors z„(k+l) 
from the current frame k+1 are used as inputs for the data 
association step. It is important to note that the number P of 
feature points may vary in time. This is due to the fact that each 
data association cycle may include initiation of feature points, 
termination of feature points as well as maintenance of feature 
points. 

A definition for the different types of processing of feature 
points is given below. 
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1. Feature point initiation: Creation of new feature points as new 
feature points are extracted. 

2. Feature point termination: Removal of a feature point when the 
feature point is no longer extracted. 

3. Feature point maintenance: Update of a feature point when the 
corresponding feature point is extracted. 

Finally, when many feature points are terminated (for instance in 
cut, fade-in, dissolve, or wipe situations) or initiated (for 
instance in cut, fade-out, dissolve, or wipe situations) at the 
same time, the frame is determined to correspond to a shot 
boundary . 

Furthermore, an activity measure for the rate of change in feature 
points in order to detect shot boundaries can be defined. Such a 
measurement will in the following be termed activity change. This 
activity measure then depends on the number of terminated or 
initiated feature points between consecutive frames. The measure 
can, for example, be defined as the maximum between terminated and 
initiated feature points calculated as a percentage. The 
percentage of initiated feature points' is the number of new 
feature points divided by the total number of feature points in 
the current frame. The percentage of terminated feature points is 
the number of removed feature points divided by the total number 
of feature points in the previous frame. 

A suitable threshold value is set and if the maximum between 
terminated and initiated feature points is above the threshold 
value, a shot boundary is determined to have occurred. Other 
definitions of activity change are of course also possible. 

In Figure 4, the detected feature points in a set of consecutive 
frames k (537), k+1 (540), k+2 (541), k+3 (542) are shown. In 
frame k+1 (540) most of the feature points from frame k (537) are 
detected. Meanwhile, few points ceased to exist and a small number 
of points appeared for the first time. At frame k+3 (542) most of 
the feature points are lost. Therefore this frame is determined to 
correspond to a shot boundary (cut) . 
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Experiments show that a shot consists of a set of successive 
stationary states with the most important information content. 
The transition between two states corresponds to a peak in the 
activity change as can be seen in Figure 3. In Fig. 3, the 
activity change as a function of time (or frames) is shown. The 
stationary states, i.e. flat parts with low activity change are 
detected and used to extract the key-frames. 

With reference again to Fig. 4, in frame k+1 (540) most of the 
feature points from frame k (537) are detected. Meanwhile, few 
points ceased to exist and a small number of points appeared for 
the first time. Therefore, the frame k+1 can be a suitable key 
frame. 

Thus, once the shot boundaries are determined using the algorithm 
as described above, one or several of the local minima between the 
shot boundaries are extracted as key-frames. The local minima have 
been shown to occur where the activity change is constant. It is 
therefore not necessary to extract the frame corresponding to the 
local minima per se, but any frame where the activity change is 
constant provides a good result. However, frames corresponding to 
the local minima in activity change between shot boundaries should 
provide the best result. 

Thus, for example, film directors use camera motion (panning, 
zooming) to show the connection between two events. Imagine a shot 
where two actors A and B are speaking to each other in a front of 
stable background. When actor A is speaking, the camera focuses 
on him. This corresponds to low activity over time (no major 
change of extracted feature points) . When actor B starts to speak, 
the camera pans to him. This panning corresponds to high activity 
over the corresponding frames. Then, as the camera comes to rest 
on actor B, the activity level falls to a low value again. Key 
frames are selected from the low-activity frames, i.e. flat parts 
in figure 3. 

The use of compressed video will make the algorithm faster. 
However, the information, which is available in the compressed 
domain in order to perform multi-target tracking is limited. A 
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compromise can be to decode only the I-frames of the video 
sequence. The I-frames are then used for the video summarization 
algorithm as described herein. 

This choice is motivated by three factors. First I-frames occur 
frequently, e.g. every 12 frames. This frame sub-sampling is 
acceptable since a shot lasts in average 5 to 2 3 seconds, see for 
example D. Colla, and G. Ghoma, "Image activity characteristics in 
broadcast television", IEEE Trans. Communications, vol. 26, pp. 
1201-1206, 1976. Second, the algorithm as described herein is able 
to deal with large motion between two successive frames, thanks to 
the use of Kalman filtering. Third, I-frames, which can be JPEG- 
coded or coded in another still image format are accessible 
independently of other frames in the video sequence such as (B-, 
P-frames) . 
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CLAIMS 

1. A method of extracting key frames from a video signal, 
characterized by the steps of: 

- extracting feature points from frames in the video signal, 

- tracking feature points between consecutive frames, 

- measuring the number of new or lost feature points between 
consecutive frames, 

- determining shot boundaries in the video signal when the number 
of new or lost feature points is above a certain threshold 
value, and 

selecting as a key frame, a frame located between two shot 
boundaries where the number of new or lost feature points 
matches a certain criteria. 

2. A method according to claim 1, characterized in that the 
threshold value is defined as the maximum between terminated and 
initiated feature points calculated as a percentage, where the 
percentage of initiated feature points is the number of new 
feature points divided by the total number of feature points in 
the current frame, and the percentage of terminated feature points 
is the number of removed feature points divided by the total 
number of feature points in the previous frame. 

3. A method according to any of claims 1-2, characterized in 
that the key frame is selected as a frame where the number of new 
or lost feature points is constant for a number of consecutive 
frames in the video signal. 

4 . A method according to any of claims 1 - 2 , characterized in 
that the key frame is selected as a frame where the number of new 
or lost feature paints corresponds to a local minima between two 
shot boundaries or where the number below a certain pre-set 
threshold value. 



5. A method according to any of claims 1-4, when the video 
signal is a compressed video signal comprising I-frames, 
characterized in that only the I-frames are decoded and used as 
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input frames for determining shot boundaries and selecting key 
frames . 

6. A method according to any of claims 1-5, characterized in 
that feature points in the frames of the video signal are 
extracted using both kinematic components and photometric 
components of the video signal. 

7. A method of shot boundary detection in a video signal, 
characterized by the steps of: 

- extracting feature points from frames in the video signal, 

- tracking feature points between consecutive frames, 

- measuring the number of new or lost feature points between 
consecutive frames, 

- determining shot boundaries in the video signal when the number 
of new or lost feature points is above a certain threshold 
value. 

8. A method according to claim 7, characterized in that the 
threshold value is defined as the maximum between terminated and 
initiated feature points calculated as a percentage, where the 
percentage of initiated feature points is the number of new 
feature points divided by the total number of feature points in 
the current frame, and the percentage of terminated feature points 
is the number of removed feature points divided by the total 
number of feature points in the previous frame. 

9 . A method according to any of claims 7 - 8 , characterized in 
that feature points in the frames of the video signal are 
extracted using both kinematic components and photometric 
components . 

10. A method according to any of claims 7-9, when the video 
signal is a compressed video signal comprising I-frames, 
characterized in that only the I-frames are decoded and used as 
input frames for determining shot boundaries and selecting key 
frames . 
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11. An apparatus for extracting key frames from a video signal, 
characterized by 

- means for measuring the number of new or lost feature points 
between consecutive frames, 

- means for determining shot boundaries in the video signal when 
the number of new or lost feature points is above a certain 
threshold value, and 

- means for selecting as a key frame, a frame located between two 
shot boundaries where the number of new or lost feature points 
matches a certain criteria. 

12. An apparatus according to claim 11, characterized in that the 
threshold value is defined as the maximum between terminated and 
initiated feature points calculated as a percentage, where the 
percentage of initiated feature points is the number of new 
feature points divided by the total number of feature points in 
the current frame, and the percentage of terminated feature points 
is the number of removed feature points divided by the total 
number of feature points in the previous frame. 

13. An apparatus according to any of claims 11 - 12, characterized 
by means for selecting the key frame as a frame where the number 
of new or lost feature points is constant for a number of 
consecutive frames in the video signal. 

14. An apparatus according to any of claims 11 - 12, characterized 
by means for selecting the key frame as a frame where the number 
of new or lost feature points corresponds to a local minima 
between two shot boundaries or where the number is below a certain 
pre-set threshold value. 

15. An apparatus according to any of claims 11 - 14, when the 
video signal is a compressed video signal comprising I-frames, 
characterized by means for only decoding the I-frames and using 
the I-frames as input frames for determining shot boundaries and 
selecting key frames. 
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16. An apparatus according to any of claims 11 - 15, characterized 
by means for extracting feature points in the frames of the video 
signal using both kinematic components and photometric components 
of the video signal. 

17. An apparatus for shot boundary detection in a video signal, 
characterized by 

- means for measuring the number of new or lost feature points 
between consecutive frames, and 

- means for determining shot boundaries in the video signal when 
the number of new or lost feature points is above a certain 
threshold value. 

18. An apparatus according to claim 17, characterized in that the 
threshold value is defined as the maximum between terminated and 
initiated feature points calculated as a percentage, where the 
percentage of initiated feature points is the number of new 
feature points divided by the total number of feature points in 
the current frame, and the percentage of terminated feature points 
is the number of removed feature points divided by the total 
number of feature points in the previous "frame. 

19 . An apparatus according to any of claims 17 - 18 , characterized 
by means for extracting feature points in the frames of the video 
signal using both kinematic components and photometric components 
of the video signal. 

20. An apparatus according to any of claims 17 - 19, when the 
video signal is a compressed video signal comprising I-frames, 
characterized by means for only decoding the I-frames and using 
the decoded I-frames as input frames for determining shot 
boundaries . 

21. A system for video summarization comprising an apparatus 
according to any of claims 11 - 20. 
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